# Synthetic data
## File list
- **README.md** describes the content of this package.
- **requirement.txt** describes essential python libraries.
- **synthic_data.py** contains codes of data generating process.
- **config/synth_binary.conf** is an example configure for `synthic_data.py`.
## Set up envirment
``` sh
     pip install -r requirement.txt
```
## Example usage
``` sh
python synthic_data.py 
```
or
``` python
from synthic_data import FeatGenerator
from pyhocon import HOCONConverter, ConfigFactory

# load configure file
conf = ConfigFactory.parse_file('config/synth_binary.conf')
# you can also set arbitrary bias ratio
conf.put('$.bias_ratio', 0.9)
fg = FeatGenerator(conf)
# generate 100000 instance from 2 clusters
fg.generate2(n_samples=100000, n_clusters=2)

```

## The detailed implementation of data generating process
We start by generating a $p$-dimensional confounder vector  $X$ from $\sum_{s=1}^S \mathrm{I}{(G=s)}\mathcal{N}(\mu_s,\Sigma_s)$, where $G\in \{1,2,\cdots,S\}$ is a hidden variable following a predetermined discrete distribution and indicates which latent cluster each subject belongs to. In our implementation, we choose the discrete distribution as binomial with parameter $N=2$ and $\phi$. The  treatment-independent  feature vector $W$  {of $p$-dimension} is similar to $X$. We  then generate the potential outcomes by the following factor-augmented auto regressive process of order one, for each $1\leq k \leq m$,
$$\begin{equation*}
    \begin{aligned}
        &Y_{t_k}^{(0)} = \rho Y_{t_{k-1}}^{(0)}+(1-\rho) \{\alpha(X,W)+f(X,W,\lambda_t)+\varepsilon\}, \quad
        Y_{t_m}^{(1)} = Y_{t_m}^{(0)}+g(X)+\varepsilon_{1},
    \end{aligned}
\end{equation*}
$$
where $\rho\in[0,1]$ is the individual-specific auto-correlation coefficient following a truncated normal distribution $\bar{\mathcal{N}}(\mu_\rho, \sigma_\rho^2;[0,1])$, $\lambda_t$ is the time-varying factor, the initial value $Y^{(0)}_{t_0}=\beta(X,W)$,  $\{\alpha(\cdot), \beta(\cdot), f(\cdot),g(\cdot)\}$ are specified functions, and $\varepsilon,\varepsilon_1$ are mutually independent noise items subject to $\mathcal{N}(0,1)$. To ensure the samples are drawn from the stationary distribution of the data generating process, we  use  outcomes with time larger than a threshold $t_T$ as observations.
We assign treatment for each unit according to its latent cluster,
such that 
$P(D=1\mid G=s)-P(D=0 \mid G=s)=(-1)^{s}\phi$ for  $s=1,2,\cdots,S$, where $\phi\in [0,1]$ is a specified value. The parameter $\phi$  measures the degree of balance between the treated and control populations; as $\phi$ rises from zero to one, the imbalance gradually increases.

Functions $\alpha(\cdot),\beta(\cdot),f(\cdot),g(\cdot)$ are specified as follows.

- The interception function $\alpha(\cdot)$ is
$
\begin{aligned}
     &\alpha(X,W)=\frac{\alpha_1(X\oplus W) + \alpha_2(X\oplus W)-\mu_\alpha}{\sigma_\alpha}\tilde{\sigma}_\alpha+\tilde{\mu}_\alpha,
\end{aligned}
$
where
$\begin{aligned}
     &\alpha_1(Z) = \frac{1}{p_z-4}\sum_{i=0}^{p_z-4} 10\sin(\pi Z_{i} Z_{i+1}) + 20(Z_{i+2} - 0.5)^2 + 10 Z_{i+3} + 5 Z_{i+4},\\
    &\alpha_2(Z) = \frac{1}{p_z-4}\sum_{i=0}^{p_z-4} \sqrt{Z_{i}^2 + \big[Z_{i+1} Z_{i+2}  - 1 / (Z_{i+1} Z_{i+3})\big]^ 2},\quad  Z\in \mathbb{R}^{p_z},
\end{aligned}
$
$\mu_\alpha=E\big[\alpha_1(X\oplus W)+\alpha_2(X\oplus W)\big]$ and $\sigma_\alpha^2=\mathrm{var}\big[\alpha_1(X\oplus W)+\alpha_2(X\oplus W)\big]$, $X\oplus W=(X^T,W^T)^T$ represents concatenating two vectors, and $\tilde{\mu}_\alpha=2$ and $\tilde{\sigma}_\alpha=2$ are the pre-defined mean and standard deviation. See function `make_friedman1` and `make_friedman2` in file "synthic_data.py" for implemented details. 

- The function $\beta(\cdot)$ is
$
\begin{aligned}
&{\beta}(X,W)=\theta_\beta^T(X\oplus W) - E\big[{\theta}_\beta^T(X\oplus W)\big] + \tilde{\mu}_\beta, \quad \theta_\beta\sim \mathcal{N}_{2p}(0,I_{2p}),
\end{aligned}
$
where $\tilde{\mu}_\beta=2$. See function `linear_regression` in file "synthic_data.py" for implemented details.

- The time-varying related function $f(\cdot)$ is
$
\begin{aligned}
& f(W,\lambda_t) = {(\Theta_f^T W)}^T\lambda_t,
\end{aligned}
$
where $\Theta_f$ is a $p\times 3$ matrix and  $\lambda_t\in \mathbb{R}^3$ denotes a time-varying factor. In addition, entries of ${\Theta}_f$ are independently sampled from a normal distribution $\mathcal{N}(0, 1)$. See function`time_series` in file "synthic_data.py" for implemented details.

- The effect function $g(\cdot)$ is
$
\begin{aligned}
&{g}({X})=\frac{\alpha_1({X})/10}{\log(|\alpha_2({X})|+1)}.
\end{aligned}
$
See function `linear_regression` in file "synthic_data.py" for implemented details.